Classification using Dirichlet priors when the training data are mislabeled

نویسندگان

Robert S. Lynch

Peter Willett

چکیده

The average probability of error is used to demonstrate performance of a Bayesian classification test (referred to as the Combined Bayes Test (CBT)) given the training data of each class are mislabeled. The CBT combines the information in discrete training and test data to infer symbol probabilities, where a uniform Dirichlet prior (i.e., a noninformative prior of complete ignorance) is assumed for all classes. Using this prior it is shown how classification performance degrades when mislabeling exists in the training data, and this occurs with a severity that depends on the value of the mislabeling probabilities. However, an increase in the mislabeling probabilities are also shown to cause an increase in M (i.e., the best quantization fineness). Further, even when the actual mislabeling probabilities are known by the CBT, it is not possible to achieve the classification performance obtainable without mislabeling.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An algorithm for correcting mislabeled data

Reliable evaluation for the performance of classifiers depends on the quality of the data sets on which they are tested. During the collecting and recording of a data set, however, some noise may be introduced into the data, especially in various real-world environments, which can degrade the quality of the data set. In this paper, we present a novel approach, called ADE (automatic data enhance...

متن کامل

Improving Automated Land Cover Mapping by Identifying and Eliminating Mislabeled Observations from Training Data

This paper presents a new approach to identifying and eliminating mislabeled training samples. The goal of this technique is to decrease the error of classification algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter...

متن کامل

Robustness of compound Dirichlet priors for Bayesian inference of branch lengths.

We modified the phylogenetic program MrBayes 3.1.2 to incorporate the compound Dirichlet priors for branch lengths proposed recently by Rannala, Zhu, and Yang (2012. Tail paradox, partial identifiability and influential priors in Bayesian branch length inference. Mol. Biol. Evol. 29:325-335.) as a solution to the problem of branch-length overestimation in Bayesian phylogenetic inference. The co...

متن کامل

A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer

MOTIVATION An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be condu...

متن کامل

Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Familiesz

A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Classification using Dirichlet priors when the training data are mislabeled

نویسندگان

چکیده

منابع مشابه

An algorithm for correcting mislabeled data

Improving Automated Land Cover Mapping by Identifying and Eliminating Mislabeled Observations from Training Data

Robustness of compound Dirichlet priors for Bayesian inference of branch lengths.

A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer

Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Familiesz

عنوان ژورنال:

اشتراک گذاری